Chicago is a multicultural city and provides many opportunities to its inhabitants. However not all communities present in the city have the same distribution and opportunities are different for people living among different areas. The Chicago goverment have deployed a hardship index which is multidimensional measure of community socioeconomic conditions, the higher the index the worse economical and social conditions are.
Using the information provided by Foursquare we want to asses which kind of venues are present in communities with different hardship indexes. If there are any differences, we can use this information to propose building specific tipes of venues to improve the communities conditions.
Also we aim to verify if the venue consistency in a neighborhood will be a good predictor for the hardship index, this will be useful to quickly assess the conditions of different locations without censing the population directly.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import folium
from geopy.geocoders import ArcGIS
import requests
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from scipy.stats import chisquare
We are going to use the Chicago Hardship Index based on Census Data from 2008 to 2012, from this dataset we are going to use only two columns the Community Area Name and the Hardship Index.
Along this dataset we are going to use Foursquare database to check what venues are more popular in each of the Community Areas, we are going to use the Category column from this dataset.
Lastly to plot our findings in the map we are going to use the shape files corresponding to the boundaries of the Community Areas provided by the Chicago City government. Available at: https://data.cityofchicago.org/Facilities-Geographic-Boundaries/Boundaries-Community-Areas-current-/cauq-8yn6
The Chicago Hardship Index dataset is publicly available at https://data.cityofchicago.org/Health-Human-Services/hardship-index/792q-4jtu and it consists in 78 rows by 7 columns, each row corresponding to one community plus one extra row for the Chicago city average. The columns are:
- Hardship Index
- Community Area Name
- Percent of Households Below Poverty
- Percent of people aged 25+ without a Highschool Diploma
- Percent of people aged 16+ unemployed
- Percent of people aged below 18 or over 64
- Per Capita Income
For the scope of this exercise we're are only going to use the 'Hardship Index' column and the 'Community Area Name'
#Get the geojson data for the communities boundaries
geojson = r'Data/Boundaries - Community Areas (current).geojson'
#Get the Chicago Hardship Index
chicago_h_i = pd.read_csv('Data/hardship_index.csv')
# Let's take a peek of the Chicago Hardship Index dataframe
chicago_h_i.head()
chicago_h_i[chicago_h_i['COMMUNITY AREA NAME'] == 'CHICAGO']['HARDSHIP INDEX']
chicago_h_i.dropna(inplace=True)
# First let's keep only the rows we're going to use
chicago_h_i = chicago_h_i.loc[:,['HARDSHIP INDEX', 'COMMUNITY AREA NAME']]
#The community area names in the geojson file are in uppercase, so let's uppercase the names in our dataset
chicago_h_i['COMMUNITY AREA NAME'] = chicago_h_i['COMMUNITY AREA NAME'].str.upper()
#And our dataset looks like this
chicago_h_i
#The locations causing trouble are O'Hare, Montclaire and Washington Height so let's change them
chicago_h_i.loc[75,'COMMUNITY AREA NAME'] = 'OHARE'
chicago_h_i.loc[72,'COMMUNITY AREA NAME'] = 'WASHINGTON HEIGHTS'
chicago_h_i.loc[17,'COMMUNITY AREA NAME'] = 'MONTCLARE'
# Adding the coordinates of the city
latitude = 41.881832
longitude = -87.623177
# Creating a Chicago map
chicago_map = folium.Map(location=[latitude, longitude], zoom_start=10)
# And now let's make this map tell us where are the communities and what is their Hardship Index
folium.Choropleth(geo_data = geojson,
name = 'choropleth',
data = chicago_h_i,
columns= ['COMMUNITY AREA NAME','HARDSHIP INDEX'],
key_on= 'feature.properties.community',
fill_color='YlOrRd',
fill_opacity=0.5,
line_opacity=0.2,
legend_name='Hardship Index').add_to(chicago_map)
#Display the map
chicago_map
That map is looking good however we can't tell what are the districts names so let us add a few markers
In order to add the markers we need to make use of the geopy.geocoders to get the coordinates of each district. I have used the ArcGis geocoder because is simple to use, it requires no user name and/or key
user_agent = "chicago_com" # Let's define a user agent name
geolocator = ArcGIS(user_agent=user_agent) # Instantiate an ArcGIS geolocator
def get_ll(community): # Define a function that will retrieve the coordinates of a given community
address = f'{community}, Chicago, Illinois' # Concatenate the query string
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
return latitude, longitude
# Define a function that returns the latitude
def lat(location):
return location[0]
# Define a function that returns the longitude
def long(location):
return location[1]
#Create a new column that will receive the location data
chicago_bin['location'] = chicago_bin['COMMUNITY AREA NAME'].apply(lambda x: get_ll(x))
#Create a new column that will receive the latitude and longitude data
chicago_bin['LATITUDE'] = chicago_bin['location'].apply(lambda x: lat(x))
chicago_bin['LONGITUDE'] = chicago_bin['location'].apply(lambda x: long(x))
# Drop the location column as we now have LATITUDE and LONGITUDE columns
chicago_bin.drop('location', inplace=True, axis=1)
# Now load the communities coordinates from the CSV file
coordinates = pd.read_csv("Data/coordinates.csv")
chicago_h_i = pd.concat([chicago_h_i,coordinates[['LATITUDE','LONGITUDE']]],axis=1)
chicago_h_i
#Lets add the markers
for lat, lng, label in zip(chicago_h_i['LATITUDE'],chicago_h_i['LONGITUDE'], chicago_h_i['COMMUNITY AREA NAME']):
label = folium.Popup(label, parse_html=True)
folium.CircleMarker(
[lat, lng],
radius=4,
popup=label,
color='blue',
fill=True,
fill_color='#3186cc',
fill_opacity=0.6,
parse_html=False).add_to(chicago_map)
If you click a marker you'll get the community name
chicago_map
chicago_bin = chicago_h_i.copy(deep=True)
bin_labels = ['Very Low','Low','Medium Low','Medium high','High','Very high']
chicago_bin['HARDSHIP INDEX'] = pd.cut(chicago_bin['HARDSHIP INDEX'],bins=6,labels=bin_labels)
chicago_bin.head()
import os
CLIENT_ID = os.getenv('CLIENT_ID')
CLIENT_SECRET = os.getenv('CLIENT_SECRET')
VERSION = '20180604'
def getNearbyVenues(names, latitudes, longitudes, radius=500):
LIMIT = 50
venues_list=[]
for name, lat, lng in zip(names, latitudes, longitudes):
# create the API request URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
CLIENT_ID,
CLIENT_SECRET,
VERSION,
lat,
lng,
radius,
LIMIT)
# make the GET request
results = requests.get(url).json()["response"]['groups'][0]['items']
# return only relevant information for each nearby venue
venues_list.append([(
name,
lat,
lng,
v['venue']['name'],
v['venue']['location']['lat'],
v['venue']['location']['lng'],
v['venue']['categories'][0]['name']) for v in results])
nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
nearby_venues.columns = ['Community',
'Community Latitude',
'Community Longitude',
'Venue',
'Venue Latitude',
'Venue Longitude',
'Venue Category']
return(nearby_venues)
#Let's call our function
chicago_venues = getNearbyVenues(names=chicago_bin['COMMUNITY AREA NAME'],
latitudes=chicago_bin['LATITUDE'],
longitudes=chicago_bin['LONGITUDE']
)
chicago_venues = pd.read_csv('Data/chicago_venues.csv', index_col=0)
chicago_venues.head()
# one hot encoding
chicago_onehot = pd.get_dummies(chicago_venues[['Venue Category']], prefix="", prefix_sep="")
# add community column back to dataframe
chicago_onehot['COMMUNITY'] = chicago_venues['Community']
# move community column to the first column
fixed_columns = [chicago_onehot.columns[-1]] + list(chicago_onehot.columns[:-1])
chicago_onehot = chicago_onehot[fixed_columns]
chicago_onehot.head()
chicago_grouped = chicago_onehot.groupby('COMMUNITY').sum().reset_index()
chicago_grouped.head()
def return_most_common_venues(row, num_top_venues):
row_categories = row.iloc[1:]
row_categories_sorted = row_categories.sort_values(ascending=False)
return row_categories_sorted.index.values[0:num_top_venues]
num_top_venues = 3
indicators = ['st', 'nd', 'rd']
# create columns according to number of top venues
columns = ['COMMUNITY']
for ind in np.arange(num_top_venues):
try:
columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
except:
columns.append('{}th Most Common Venue'.format(ind+1))
# create a new dataframe
community_venues_sorted = pd.DataFrame(columns=columns)
community_venues_sorted['COMMUNITY'] = chicago_grouped['COMMUNITY']
for ind in np.arange(chicago_grouped.shape[0]):
community_venues_sorted.iloc[ind, 1:] = return_most_common_venues(chicago_grouped.iloc[ind, :], num_top_venues)
community_venues_sorted.head()
# The communities are sorted alphabetically so let's sort the index too
hardship_index = chicago_bin.sort_values('COMMUNITY AREA NAME')['HARDSHIP INDEX']
community_venues_sorted['hardship_index'] = hardship_index
community_venues_sorted.head()
# First we create a set to contain the category names
top_locations_set = set()
# Iterate through the columns to get the category names
for col in community_venues_sorted.columns[1:-1]:
for val in community_venues_sorted[col].values:
top_locations_set.add(val)
# How many different categories do we have in the top 3?
len(top_locations_set)
top_venues = chicago_grouped[chicago_grouped.columns & top_locations_set].copy(deep=True)
top_venues.set_index(chicago_grouped['COMMUNITY'],inplace=True)
top_venues.head()
### Add the categorical hardship index to the dataframe
top_venues['hardship_index'] = chicago_bin['HARDSHIP INDEX'].values
top_venues.head()
# Group by Hardship Index
top_venues_grouped = top_venues.groupby('hardship_index').sum()
# Display the information in a heatmap
plt.figure(figsize=(22,8))
sns.heatmap(top_venues_grouped[top_venues_grouped.columns[:-1]])
plt.title('Distribution of Top Venues in the Chicago communities')
plt.show()
# Define the data to be scaled
X = top_venues_grouped[top_venues_grouped.columns[:-1]]
# Set up the scaler
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
# Transform the data
X = scaler.fit_transform(X)
# Create a new Dataframe with the scaled data
scaled_top_venues = pd.DataFrame(X, index=top_venues_grouped.index,columns=top_venues_grouped.columns[:-1])
plt.figure(figsize=(22,8))
sns.heatmap(scaled_top_venues)
plt.title('Scaled distribution of Top Venues in the Chicago communities')
plt.savefig('img/heatmap.jpg')
Testing the data with a chi-squared test will tell us if the distribution of the venues is completely random or not
#Let's get the p-values
chisq = chisquare(X, axis=1)
print(f'The p-values for each of the hardship indexes are: {chisq[1]}')
# Let's create our training and test samples
X = top_venues[top_venues.columns[0:-1]]
y = community_venues_sorted['hardship_index']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
# Define the parameters to be used with the grid search
params = {'n_neighbors': range(1,25)}
# Instantiate a grid and train it
grid_knn = GridSearchCV(KNeighborsClassifier(),param_grid=params, cv=5, scoring = 'jaccard_weighted')
grid_knn.fit(X_train,y_train)
# Get the best parameters
print(f'The best parameters for the model are: {grid_knn.best_params_}')
# Get predictions
predict = grid_knn.predict(X_test)
# Evaluate the model's performance
print(classification_report(y_test, predict))
# Get predictions
predict = grid_knn.predict(X)
# Evaluate the model's performance
print(classification_report(y, predict))
# Define the parameters to be used with the grid search
params = {'n_estimators': range(10,25), 'criterion':['gini', 'entropy']}
# Instantiate a grid and train it
grid_rf = GridSearchCV(RandomForestClassifier(),param_grid=params, cv=5, scoring = 'jaccard_weighted')
grid_rf.fit(X_train,y_train)
# Get the best parameters
print(f'The best parameters for the model are: {grid_rf.best_params_}')
# Get predictions
predict = grid_knn.predict(X_test)
# Evaluate the model's performance
print(classification_report(y_test, predict))
# Get predictions with the whole set
predict = grid_knn.predict(X)
# Evaluate the model's performance
print(classification_report(y, predict))
The Chicago communities are pretty similar in types of venues distribution so they are not a good predictor of the Hardship Index. Unfortunately this mean that (at least with the data presented here) there are no clear indications of which kind of venues can improve a community socioeconomic conditions, in the data obtained through the analysis we can verify that some buildings and spaces that we might think can improve community life such as parks are more represented in communities with high Hardship Index, indicating that the problem may lay elsewhere.
Mexican Restaurants are the top venue at low Hardship Index locations, so if you are loooking for some comfort food you can go to your local mexican deliciousness parlor
If you're from Chicago I leave you here a map of mexican restaurants
mex_rest = chicago_venues[chicago_venues['Venue Category'] == 'Mexican Restaurant'][['Venue','Venue Latitude', 'Venue Longitude']]
mexican_restaurants = folium.Map(location=[latitude, longitude], zoom_start=10)
folium.Choropleth(geo_data = geojson,
name = 'choropleth',
data = chicago_h_i,
columns= ['COMMUNITY AREA NAME','HARDSHIP INDEX'],
key_on= 'feature.properties.community',
fill_color='YlOrRd',
fill_opacity=0.5,
line_opacity=0.2,
legend_name='Hardship Index').add_to(mexican_restaurants)
for lat, lng, label in zip(mex_rest['Venue Latitude'],mex_rest['Venue Longitude'], mex_rest['Venue']):
label = folium.Popup(label, parse_html=True)
folium.CircleMarker(
[lat, lng],
radius=4,
popup=label,
color='green',
fill=True,
fill_color='#3186cc',
fill_opacity=0.6,
parse_html=False).add_to(mexican_restaurants)
mexican_restaurants